Filtered metabolomics data to remove noise and identify significant peaks. Three approaches compared.
Root samples have more concentrated signal - fewer peaks make up 80% of the total.
| Comparison | Count |
|---|---|
| In both 90% and ≥0.01% | 5,819 |
| Only in 90% | 958 |
| Only in ≥0.01% | 389 |
What this means:
Each row shows one tissue sample and how many peaks were needed to account for 80% of its total signal. Samples with fewer peaks needed have more concentrated signal (dominated by a few compounds).
| # | ID | Peaks Needed | Smallest Peak Kept |
|---|---|---|---|
| 1 | AL | 1,317 | 0.0118% |
| 2 | BL | 547 | 0.0258% |
| 3 | CL | 692 | 0.0217% |
| 4 | DL | 1,317 | 0.0121% |
| 5 | EL | 664 | 0.0216% |
| 6 | FL | 1,303 | 0.0124% |
| 7 | GL | 756 | 0.0204% |
| 8 | HL | 1,289 | 0.0124% |
| 9 | IL | 692 | 0.0205% |
| 10 | JL | 568 | 0.0248% |
| 11 | KL | 1,217 | 0.0129% |
| 12 | LL | 676 | 0.0218% |
| 13 | ML | 598 | 0.0245% |
| 14 | NL | 1,314 | 0.0123% |
| 15 | OL | 747 | 0.0206% |
| 16 | PL | 661 | 0.0225% |
| 17 | QL | 1,376 | 0.0114% |
| 18 | RL | 614 | 0.0230% |
| 19 | SL | 1,378 | 0.0117% |
| 20 | TL | 519 | 0.0271% |
| 21 | UL | 1,143 | 0.0131% |
| # | ID | Peaks Needed | Smallest Peak Kept |
|---|---|---|---|
| 1 | AR | 442 | 0.0332% |
| 2 | BR | 276 | 0.0445% |
| 3 | CR | 368 | 0.0368% |
| 4 | DR | 448 | 0.0338% |
| 5 | ER | 515 | 0.0264% |
| 6 | FR | 334 | 0.0360% |
| 7 | GR | 433 | 0.0325% |
| 8 | HR | 172 | 0.0678% |
| 9 | IR | 372 | 0.0392% |
| 10 | JR | 478 | 0.0297% |
| 11 | KR | 264 | 0.0446% |
| 12 | LR | 244 | 0.0473% |
| 13 | MR | 496 | 0.0304% |
| 14 | NR | 302 | 0.0441% |
| 15 | PR | 225 | 0.0469% |
| 16 | QR | 126 | 0.0706% |
| 17 | RR | 428 | 0.0341% |
| 18 | SR | 400 | 0.0413% |
| 19 | TR | 426 | 0.0339% |
| 20 | UR | 154 | 0.0727% |
Example: AL (Leaf tissue from tree A) needed 1,317 peaks to reach 80% of its signal. The smallest peak kept contributed 0.0118% - anything contributing less was filtered as noise.
Example: UR (Root tissue from tree U) only needed 154 peaks because its signal is concentrated in fewer compounds. Only peaks contributing at least 0.0727% made the cut.
The requested approach for filtering noise from metabolomics data.
1. Take all peaks and their area values for one sample 2. Sort peaks from LARGEST to SMALLEST 3. Add up the areas as you go down the list 4. Stop when you've added up 80% (or 90%) of the total 5. Everything above that line is kept
Important: A compound is kept if it makes the cut in ANY sample.
1. Take all peaks and their area values for one sample 2. Add up all areas to get the total 3. For each peak: what percentage of the total is this? 4. If it's at least 0.01%, keep it
The algorithm keeps adding peaks until the cumulative sum crosses 80%. The last peak kept is the one that pushes you over the threshold:
| Peak # | % Contribution | Cumulative | Status |
|---|---|---|---|
| 1314 | 0.01182% | 79.9527% | KEPT |
| 1315 | 0.01181% | 79.9645% | KEPT |
| 1316 | 0.01180% | 79.9763% | KEPT |
| 1317 | 0.01179% | 79.9881% | KEPT |
| 1318 | 0.01178% | 79.9999% | KEPT |
| 1319 | 0.01177% | 80.0117% | KEPT ← crossed 80% |
| 1320 | 0.01176% | 80.0234% | FILTERED |
| 1321 | 0.01175% | 80.0352% | FILTERED |
Peak 1319 is the last one kept because it's the peak that pushed the cumulative total past 80%. Peak 1320 contributes almost the same amount (0.01176% vs 0.01177%) but is filtered because we already crossed the threshold. This is the "arbitrary cutoff" - two nearly identical peaks get different treatment based on where 80% happened to fall.
Peaks that contribute to 80% of the signal in at least one tissue sample. This is the requested filtering threshold.
More conservative threshold - keeps peaks that contribute to 90% of signal. Use this if 80% seems too aggressive.
Alternative approach - keeps any peak contributing at least 0.01% of signal in any tissue sample. Avoids arbitrary rank-based cutoffs.
Each compound is identified by a code like 3.90_564.1489n. This encodes two measurements:
| Part | Example | Meaning |
|---|---|---|
| First number | 3.90 | Retention time (minutes) - how long it took to pass through the column |
| Second number | 564.1489 | Mass (m/z) - the molecular weight detected |
| Suffix | n or m/z | Just notation style |